Hey All Is there a way to define a more abstract generic typ Hamilton Open Source #hamilton-help

Hey All, Is there a way to define a more abstract/...

Gilad Rubin

07/15/2024, 9:34 AM

Hey All, Is there a way to define a more abstract/generic type hint to enable easy swapping of objects? For example: DAG:

from langchain_community.document_loaders.pdf import BasePDFLoader

def pdf_pages(pdf_path: Union[str, Path],

pdf_loader: BasePDFLoader) -> List[Document]:

return pdf_loader(pdf_path).load()

Driver:

from langchain_community.document_loaders import PyMuPDFLoader

results = dr.execute(final_vars = outputs, inputs={"pdf_loader" : PyMuPDFLoader)

PyMuPDFLoader inherits from BasePDFLoader. I'm getting the following error: Error: Type requirement mismatch. Expected pdf_loader:typing.Type[langchain_community.document_loaders.pdf.BasePDFLoader] got class 'langchain_community.document_loaders.pdf.PyMuPDFLoader':class 'abc.ABCMeta' instead. My motivation comes from wanting to check the DAG with a different Langchain PDF loaders. Thanks!

Thierry Jean

07/15/2024, 1:41 PM

Hi Gilad! The code behavior you want should work! In the specific snippet you shared, I believe there's an error in the Driver code. This

Copy code

..., inputs={"pdf_loader": PyMuPDFLoader})

Should be (it's missing the

()

for object instantiation)

Copy code

..., inputs={"pdf_loader": PyMuPDFLoader()})

Gilad Rubin

07/15/2024, 1:43 PM

Hey Thierry 🙂 In this specific case - the PyMuPDFLoader needs to be initialized with the pdf_path itself (it's a weird implementation by langchain) so it actually should be inserted like I did PyMuPDFLoader

👍 1

Thierry Jean

07/15/2024, 1:45 PM

Here's some more details on classes and subclasses for type hints, taking scikit-learn as an example. The following code is valid

Copy code

from sklearn.linear import LinearRegression

# returns specific
def trained_model() -> LinearRegression:
  return LinearRegression()

# expects generic, which includes the specific
def prediction(trained_model: BaseEstimator):
  return ...

While this isn't

Copy code

from sklearn.linear import LinearRegression

# returns generic
def trained_model() -> BaseEstimator:
  return LinearRegression()

# expects specific, which may conflict with the generic
def prediction(trained_model: LinearRegression):
  return ...

Thierry Jean

07/15/2024, 1:46 PM

Ok! That's an interesting API 😅 Do you mind sharing the code of the node

pdf_loader

Gilad Rubin

07/15/2024, 1:46 PM

from langchain_community.document_loaders import PyMuPDFLoader pdf_loader = PyMuPDFLoader https://api.python.langchain.com/en/latest/_modules/langchain_community/document_loaders/pdf.html#PyMuPDFLoader

Thierry Jean

07/15/2024, 1:47 PM

Ok, but I'm guessing you can't do

inputs={"pdf_loader": PyMuPDFLoader(pdf_path=...)}

because the

pdf_path

is injected in the Hamilton DAG (i.e., it comes from another node) ?

Gilad Rubin

07/15/2024, 1:52 PM

Hm............. my intuition tells me I'd prefer to have the path in the DAG, but I'm not sure.... 🙂

Thierry Jean

07/15/2024, 1:52 PM

Error: Type requirement mismatch. Expected pdf_loader:typing.Type[langchain_community.document_loaders.pdf.BasePDFLoader] got <class 'langchain_community.document_loaders.pdf.PyMuPDFLoader'>:<class 'abc.ABCMeta'> instead.

This error code is due to

Driver.execute()

. It tells me that you have a valid DAG with

pdf_loader: langchain_community.document_loaders.pdf.BasePDFLoader

but there's a mismatch with the

pdf_loader

value passed in inputs.

Thierry Jean

07/15/2024, 1:54 PM

Would it makes sense to have a function/node to create this object?

Copy code

def pdf_loader(pdf_type: abc.ABCMeta, pdf_loader_config: dict) -> BasePDFLoader:
  return pdf_type(**pdf_loader_config)  # this creates a `BasePDFLoader`

def my_other_func(pdf_loader: BasePDFLoader, ...) -> ...:
  return ...

Driver code

Copy code

dr.execute(..., inputs={"pdf_type": PyMuPDFLoader, pdf_loader_config={"pdf_path": ...}})

Thierry Jean

07/15/2024, 1:55 PM

If the LangChain API doesn't provide the necessary flexibility to create the object in the Driver code, I would opt for this option. You don't lose visibility IMO and you can add a docstring explaining what the node does and justify your design decision (i.e., why not in Driver code)

Gilad Rubin

07/15/2024, 1:56 PM

I don't like to have pdf_loader_config. I would prefer to have pdf_path explicitly in the inputs. I'm currently using "Any", which is not optimal, but works.

def pdf_pages(pdf_path: Union[str, Path],

pdf_loader: Any) -> List[Document]:

return pdf_loader(pdf_path).load()

👍 1

Gilad Rubin

07/15/2024, 1:59 PM

or maybe I'll take your implementation and add pdf_path and additional_loader_config 🙂

👍 1

Thierry Jean

07/15/2024, 1:59 PM

That's another valid option! Maybe it's worth documenting (via the type hint or docstring) that

pdf_loader

is a type rather than an object?

👍 1

Gilad Rubin

07/15/2024, 2:00 PM

Yeah, you're right. Thanks a lot Thierry 🙂

Thierry Jean

07/15/2024, 2:06 PM

Each person has their own approach, but if you have a main config, one approach I like for ML models is to do:

Copy code

def model_config(model_config_overrides: Optional[dict] = None) -> dict:
  config = dict(...)
  if model_config_overrides:
    config.update(**model_config_overrides)  

  return config

def base_model(model_config: dict) -> BaseEstimator:
  return LinearRegression(**model_config)

def trained_model(base_model: BaseEstimator, ...) -> BaseEstimator:
  # fit model
  return base_model

This has the following benefits: • the most common config is directly in my code

model_config()

and I don't have to specify a config to use

dr.execute()

• if I want to change a few values at the same time, I can use

model_config_override

. This will be very explicit from

dr.execute()

and I can inspect the return value of

model_config()

if I need to debug • for full control or testing, I can use

dr.execute(..., overrides={"model_config": dict()})

to explicitly pass a config and skip potential bugs in

model_config()

It might be a bit overengineered, but you might find it useful if you do a lot of experimentation! example: https://github.com/zilto/ordinal-forecasting-digital-phenotyping/blob/master/src/tabular_model.py#L164

Gilad Rubin

07/15/2024, 2:46 PM

Thanks! It helps a lot. Since I like things to be configurable and swappable - I would prefer not to "hardcode" the LinearRegression part but to pass it to the DAG as input and define it outside of the DAG so that I can easily test and compare different models. Does that make sense to you?

Thierry Jean

07/15/2024, 2:55 PM

Yup! So that would depend on the level of flexibility. • If you have a known set of models to include, you can use

@config.when()

(e.g., swap between LinearRegression, SVM, HistGradientBoostingRegressor). • If you want even more flexibility / dynamism, you can write

Copy code

def model_config(model_config_overrides: Optional[dict] = None) -> dict:
  config = dict(...)
  if model_config_overrides:
    config.update(**model_config_overrides)  

  return config

# you can set a default value if you want
def base_model(model_config: dict, model_type: type = LinearModel) -> BaseEstimator:
  return model_type(**model_config)

def trained_model(base_model: BaseEstimator, ...) -> BaseEstimator:
  # fit model
  return base_model

With this much flexibility though, you probably want to check if the

model_config

will be valid for the

model_type

👍 1

Gilad Rubin

07/15/2024, 2:56 PM

Looks good. thanks! can you explain this last sentence "t`ig` will be valid for the

model_type"

Thierry Jean

07/15/2024, 2:57 PM

updated the comment, accidentally deleted some text

👍 1

Open in Slack

Previous Next